-
Notifications
You must be signed in to change notification settings - Fork 1k
Adding try_append_value implementation to ByteViewBuilder
#8594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
73faf99 to
8859ff7
Compare
| .map(u32::from_le_bytes) | ||
| .ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError( | ||
| "String must be at least 4 bytes for non-inline view".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is unreachable as we checked that the value is longer than MAX_INLINE_VIEW_LEN (12 bytes) above.
| let offset = self.in_progress.len() as u32; | ||
| let offset: u32 = self.in_progress.len().try_into().map_err(|_| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "In-progress buffer length {} exceeds u32::MAX", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think the method can recover by starting a new in-progress buffer instead of returning an error here.
-
I am unsure if this error is even reachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new buffer would be allocated in the line immediately above this. Maybe we should do a checked add in let required_cap = self.in_progress.len() + v.len(); 🤔
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
I think that would be the opposite, a usize in a 64-bit arch wouldn't fit a u32? Anyway, I will review and update these changes over the weekend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right -- thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if this error is even reachable.
I think this is right. I wasn't able to trigger this. For the panic to verify, it would need to skip the flush_in_progress() operation that happens when the buffer reaches the required capacity.
On top of that, the push_completed used by the flush_in_progress asserts on the block.len() (see below). Therefore, let offset: u32 = self.in_progress.len() wouldn't be reached:
arrow-rs/arrow-array/src/builder/generic_bytes_view_builder.rs
Lines 273 to 276 in d49f017
| fn push_completed(&mut self, block: Buffer) { | |
| assert!(block.len() < u32::MAX as usize, "Block too large"); | |
| assert!(self.completed.len() < u32::MAX as usize, "Too many blocks"); | |
| self.completed.push(block); |
I'm proceeding by reverting this the map_err in the offset initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @samueleresca
| let offset = self.in_progress.len() as u32; | ||
| let offset: u32 = self.in_progress.len().try_into().map_err(|_| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "In-progress buffer length {} exceeds u32::MAX", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new buffer would be allocated in the line immediately above this. Maybe we should do a checked add in let required_cap = self.in_progress.len() + v.len(); 🤔
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
|
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samueleresca and @ctsk -- this change makes sense to me.
I have kicked off some benchmark runs just to make sure this doesn't affect performance somehow. Assuming they look good I think we can merge this one in
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
# Which issue does this PR close? N/A # Rationale for this change not testing the correct length # What changes are included in this PR? remove * 8 as the length of the buffer is in bytes already # Are these changes tested? created tests to make sure they are failing before AND created tests that make sure that ceil is used for future changes # Are there any user-facing changes? Nope
…tArray` (apache#8627) # Which issue does this PR close? - Closes apache#8610 # Rationale for this change Since the fields of `VariantArray` impl `PartialEq`, this PR simply derives `PartialEq` for `VariantArray` out. Based off of apache#8625
…trings for error messages (apache#8636) # Which issue does this PR close? This is a small performance improvement for the thrift remodeling - Part of apache#5853. # Rationale for this change Some of the often-called methods in the thrift protocol implementation created `ParquetError` instances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining. # What changes are included in this PR? Introduce a separate error type `ThriftProtocolError` that is smaller than `ParquetError` and does not contain any allocated data. The `ReadThrift` trait is not changed, since its custom implementations actually require the more expressive `ParquetError`. # Are these changes tested? The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer. # Are there any user-facing changes? The `ThriftProtocolError` is crate-internal so there should be no api changes. Some error messages might differ slightly.
df730fd to
a0e79a4
Compare
|
I reviewed the benchmarks and I conclude they don't show any meaningful performance change, as expected. Thanks again @samueleresca |
Which issue does this PR close?
concat_elements_utf8viewpanics with large buffer on 64bit machines datafusion#17857Rationale for this change
These changes add a safer version of
append_valueinByteViewBuilderthat handles panics calledtry_append_value. Datafusions will consume the API and handle the Result coming back from the function.What changes are included in this PR?
Are these changes tested?
The method is already covered by existing tests.
Are there any user-facing changes?
No breaking changes, as the original
append_valuemethod hasn't changed.